learning model can be constructed based on the feature space as
elow, where Ω is a subspace to be generated using a cluster
model or a classification analysis model f(X),
݂ሺ܆ሻ⟹⋃Ω
(7.15)
understood that different species will reserve different sequence
in their genomes. A nucleotide sequence or a protein sequence is
assumed to be a string of genetic recombination of the nucleic
he amino acids. Therefore they will form different distributions
nce statistics. In natural language processing, a library of basic
nstitutes the basis for the classification of articles or books. A
e sequence or a protein sequence of a species can be treated as an
a book. It should contain some basic sequence statistics pattern
w short sub-sequences of the nucleic acids or the amino acids are
d within a genome or a whole sequence. Therefore, extracting
ence statistics to generate the frequency or library of sub-
s from a sequence can be used to compare or classify sequences,
ecies.
equence statistics mainly represent the distribution of words,
e consecutive sub-strings within a sequence string. These sub-
r words are called k-mers in the application to sequence
on. For instance, any nucleotide sequence is a chain or a string of
eic acids, i.e., ܛ∈ሺܣ, ܥ, ܩ, ܶሻℓ. In case k = 1 when using the
pproach to represent sequences, ܠ∈࣬ସ. There are 16 2-mer
ch as AA, AC, AG, AT, CA, etc. Therefore, ܛ⟹ܠ∈࣬ଵ if
re used to represent nucleotide sequences. There are 64 3-mer
ch as AAA, AAC, etc in a nucleotide sequence. Therefore, ܛ⟹
ସ if 3-mers are used to represent nucleotide sequences.
ollection of all words for a specific size k is called a word set.
the times each word occurs in a sequence thus builds up a
statistics library or the word library. Similar sequences are
to have similar sequence statistics patterns.